Journal of Cheminformatics — Latest Matching Preprints

1

BBB-Nuke: Transport-Aware Prediction of Blood-Brain Barrier Penetration in Small Molecules

Abasciano, N.; Hadipour, H.; Poddar, A.; Rudrum, J.; Sobodu, T.

2026-07-14 bioengineering 10.64898/2026.07.13.738280 medRxiv

Top 0.1%

35.9%

Show abstract

Predicting blood-brain barrier (BBB) penetration remains a central challenge in CNS drug discovery. Existing computational models rely on physicochemical descriptors and are blind to active transport biology - the efflux pumps and carrier proteins that dominate drug exclusion at the BBB in vivo. We present BBB-Nuke, a modular prediction pipeline that integrates physicochemical scoring with explicit efflux transporter substrate modeling. The system computes ten molecular descriptors, predicts ionization state via a graph convolutional network, scores CNS-MPO desirability, and estimates substrate probability for seven efflux transporters (P-gp/MDR1, BCRP/ABCG2, MRP1, MRP2, MRP4, MATE1, OAT3) using Random Forest classifiers trained on curated ChEMBL bioactivity data. A gradient-boosted classifier trained on 67 features - ten physicochemical, seven efflux transporter probabilities, and fifty fingerprint-derived principal components - achieves an area under the receiver operating characteristic curve (AUROC) of 0.933 {+/-} 0.006 under five-fold cross-validation on 9,262 labeled compounds, and 0.810 on a fully held-out benchmark of 470 clinically validated compounds. In head-to-head comparisons, BBB-Nuke outperforms CNS-MPO, LightBBB, ADMETlab 2.0, and BBB-Score on both cross-validation and external test sets. We apply the pipeline to screen over one billion commercially available compounds from the Enamine REAL library and PubChem, identifying enriched regions of BBB-penetrant chemical space and characterizing the structural features that distinguish permeable from excluded molecules. BBB-Nuke is freely available as a Python package, REST API, and Model Context Protocol server.

2

BBBP_Atlas: Unified Interpretable Modeling of Blood Brain Barrier Permeability across Small Molecules and Peptides

Shen, X.; Su, Q.; Luo, H.; Gou, Q.; Ge, J.; Hou, T.; Wang, J.; Kang, Y.

2026-07-09 bioinformatics 10.64898/2026.07.06.736742 medRxiv

Top 0.1%

13.1%

Show abstract

Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system drug discovery, yet existing models are often limited by their reliance on predefined physicochemical descriptors, small-molecule-centered training sets, or conformation-dependent representations, which restricts their transferability across chemically diverse modalities especially peptides. In addition, publicly available BBBP datasets remain fragmented, inconsistently standardized, and weakly controlled for molecular redundancy, increasing the risk of data leakage and overestimated model performance. In this study, we propose BBBP-Atlas, a structure-aware BBB permeability prediction model designed for unified modeling of small molecules and peptides with the first cross-modal dataset OmniBBBP. Designed to bypass descriptor and conformation dependencies, our model represents standardized molecular structures as atom-level graphs to capture local atom-bond environments and long-range topological dependencies associated with BBB transport. This design enables direct learning of structure-permeability relationships from molecular topology. For model training and evaluation, we curated a cross-modal, redundancy-filtered database OmniBBBP that seamlessly unifies small molecules and complex peptides, containing 10,218 unique compounds with 9,316 small molecules and 902 peptides. BBBP-Atlas achieved an accuracy of 0.8914 and an MCC of 0.7678 on the independent test set. On a balanced external benchmark of 200 compounds, our model reached an AUC of 0.9108, an accuracy of 0.8500, and an MCC of 0.7000, outperforming LightBBB by an absolute MCC gain of 6%. Case studies further showed that BBBP-Atlas captured clinically meaningful BBB permeability patterns, correctly identifying lorlatinib as BBB-permeable and vancomycin as BBB-impermeable with high confidence. The OmniBBBP-backed BBBP-Atlas offers a versatile and cross-modal approach for single-compound prediction, batch screening, and dataset exploration for CNS drug discovery. BBBP-Atlas is available at https://cadd.drugflow.com/bbbp/.

3

PredHLM: quantitative and interpretable prediction of metabolic half-life in human liver microsomes

Jang, J.; Cho, N.-C.; Oh, K.-S.

2026-07-08 bioinformatics 10.64898/2026.07.02.736062 medRxiv

Top 0.1%

12.3%

Show abstract

Motivation: Human liver microsome (HLM)-based metabolic stability assays are fundamental in early drug discovery, shaping pharmacokinetic profiles and oral bioavailability. However, these experimental assays are labor-intensive and time-consuming, limiting their application in large-scale virtual screening. Computational models can prioritize compounds at scale, yet most are classification-based, leaving quantitative and interpretable prediction of HLM half-life limited. Results: In this study, we developed a quantitative machine learning model for the direct prediction of HLM half-life (T1/2) by integrating 11,790 compounds combining in-house and curated public data. Among various combinations of molecular features and learning algorithms, the XGBoost model with RDKit 2D descriptors achieved the best predictive performance, with an RMSE of 0.507 and an R2 of 0.431 on an independent test set. Shapley Additive Explanations (SHAP) analysis identified lipophilicity and known metabolic soft-spot features as the primary contributors to the predictions. These results suggest that this quantitative approach provides a practical framework for defining metabolic stability margins, thereby supporting rapid Go/No-go decisions in preclinical drug discovery. Availability: The source code, data, and trained model are available at https://github.com/joshua-416/PredHLM.

4

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.1%

10.9%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

5

A control-validated pan-proteome deep-learning pipeline nominates GPR35 as a candidate target of the orphan bacterial metabolite ligiamycin A

Martin, J.

2026-07-06 bioinformatics 10.64898/2026.07.01.735807 medRxiv

Top 0.1%

8.2%

Show abstract

Most microbial natural products with documented bioactivity lack an identified molecular target, which limits their development. We present an open, control-validated computational pipeline for natural-product target hypothesis generation. It combines a pan-proteome deep-learning drug-target interaction (DTI) model (a graph neural-network ligand encoder, an ESM-2 protein language-model encoder, and bidirectional cross-attention) with bias-corrected ranking and control-anchored molecular docking. Applying it to ligiamycin A, a 2022-described Streptomyces/Achromobacter co-culture decalin-amino-maleimide with no reported target, we find that the predicted interactions of the compound are dominated by class-A G-protein-coupled receptors. Using a drug with a known target (losartan) we identify and correct a frequent-hitter bias in the raw model; after correction the standout candidates are uniformly class-A GPCRs, led by the orphan receptor GPR35. Structure-based docking with matched positive and negative controls across three candidates corroborates GPR35 specifically: ligiamycin A scores comparably to the known GPR35 agonist zaprinast at the agonist pocket (-8.1 vs -8.3 kcal/mol; non-binder floor -5.5), whereas FFAR1 is excluded and histamine H2 is inconclusive. We propose GPR35 as a prioritized, experimentally testable target and release the workflow as a reusable tool. The result is a computational hypothesis that requires experimental validation.

6

MolMAE: A Surface-Centric Multimodal Masked Autoencoder for Molecular Representation Learning

Li, J.

2026-07-14 bioinformatics 10.64898/2026.07.11.737987 medRxiv

Top 0.1%

6.8%

Show abstract

Molecular representation learning has become a central component of modern computational drug discovery. Existing molecular foundation models mainly rely on SMILES strings, two-dimensional molecular graphs, or three-dimensional atomic coordinates. However, many molecular properties are ultimately governed by the molecular surface, where intermolecular recognition, solvation, electrostatic complementarity, and ligand-protein interactions occur. In this work, we propose MolMAE, a surface-guided multimodal masked autoencoder for molecular representation learning. MolMAE takes molecular surface point clouds, three-dimensional molecular graphs, and SMILES-derived fragment and functional-group tokens as complementary input modalities, and learns a unified multimodal molecular embedding through functional-group-aligned masked autoencoding. During pretraining, chemically corresponding local regions are jointly masked across surface, graph, fragment, and functional-group views, forcing the model to reconstruct missing geometric, physicochemical, structural, and semantic information from the remaining context. While molecular surface reconstruction serves as the primary pretraining objective, graph-, fragment-, and functional-group-level reconstruction tasks provide complementary supervision that encourages the model to capture molecular topology, bonding patterns, stereochemistry, local chemical environments, and substructure organization. In addition to reconstructing surface geometry, MolMAE reconstructs surface-associated physicochemical fields, including electrostatic potential and Fukui-related descriptors, enabling the model to learn chemically meaningful surface representations. Pretrained on approximately 261K lead-like bioactive molecules, MolMAE achieves strong performance on the ESOL benchmark under scaffold splitting and competitive performance across multiple molecular property prediction tasks. These results suggest that molecular surface-guided pretraining can complement conventional graph-, sequence-, and atom-coordinate-based molecular representations, especially for property prediction tasks influenced by exposed surface geometry and surface-associated physicochemical patterns.

7

ConfDock: Atom-specific Uncertainty Quantification for Molecular Docking via Conformal Prediction

Hao, H.; Elhendawy, N.; Wang, Y.; Lu, C.

2026-07-01 biochemistry 10.64898/2026.06.29.735353 medRxiv

Top 0.1%

6.8%

Show abstract

Molecular docking is widely used in structure-based drug discovery, yet most approaches provide point estimates without rigorous uncertainty quantification. This limitation makes it difficult to assess when a predicted pose should be trusted, especially when docking methods are applied to diverse protein-ligand systems. We present ConfDock, a conformal prediction (CP) framework for constructing atom-specific prediction intervals for ligand docking poses. ConfDock combines graph neural network (GNN) based quantile estimation with split conformal calibration, producing intervals that adapt to local protein-ligand environments while retaining distribution-free finite-sample coverage guarantees. We evaluate ConfDock on 238 protein-ligand complexes across four docking methods representing distinct computational paradigms. The proposed approach yields substantially narrower prediction intervals compared to standard split CP (57.2% average reduction in mean interval width, up to 74.5%) while maintaining target coverage across all evaluated settings. Ablation analysis indicates that the GNN captures the dominant structure-dependent variability in uncertainty, whereas the conformal calibration step provides a bounded adjustment to ensure coverage guarantees. These results demonstrate that combining learned, structure-aware quantile estimation with conformal calibration enables rigorous uncertainty quantification for molecular docking at atom-level resolution.

8

StructureSAFE: A structure-aware chemical language model for unified hit identification and lead optimization

Yang, B.; Xu, K.; Xiang, C.; Lee, B.; Xu, Y.; Li, T.; Shi, Y.; Sinitskiy, A.; Li, J.

2026-07-02 bioinformatics 10.64898/2026.06.28.735128 medRxiv

Top 0.1%

6.3%

Show abstract

Structure-based generative models (SBGMs) hold great promises for accelerating drug discovery by enabling target-aware molecular design. However, existing approaches face fundamental challenges: three-dimensional graph-based models can explicitly incorporate protein structural information but often generate chemically implausible molecules due to limited training data, while chemical language models (CLMs) produce chemically plausible molecules but struggle to effectively leverage three-dimensional structural information for structure-conditioned generation and hard to incorporate lead optimization functionality due to the nature of SMILES string. Here, we present StructureSAFE, a structure-aware chemical language model that resolves this trade-off by integrating protein structural and evolutionary encoders with the SAFE molecular representation via pretraining and finetuning training scheme, enabling both de novo hit identification and a comprehensive suite of lead optimization subtasks within a unified framework. Comprehensive benchmarking on the MolGenBench dataset demonstrates that StructureSAFE achieves state-of-the-art (SOTA) performance across multiple metrics, with particularly pronounced improvements in chemical plausibility relative to graph-based models lacking pretraining. Evaluation on a rigorously constructed held-out test set further confirms its ability to generate drug-like, synthetically accessible molecules with competitive predicted binding affinities for previously unseen targets on both hit identification and lead optimization setting. In silico case studies across four therapeutically relevant targets validate its capacity to generate chemically plausible molecules that recapitulate key binding interactions of known high-affinity ligands while proposing novel interactions for potential better affinity and exploring previously unknown regions of chemical space. Taking together, StructureSAFE represents a versatile and practical tool to provide high-quality candidate molecules for augmenting medicinal chemistry workflows in both hit identification and lead optimization campaigns.

9

ADMET Property Prediction with Quantum-Inspired Preprocessing

Mansour, B.; Rafaelyan, G.

2026-07-05 bioinformatics 10.64898/2026.06.30.735582 medRxiv

Top 0.1%

5.4%

Show abstract

Accurate prediction of Absorption, Distribution, Metabolism, Excretion, and Toxicity (ADMET) properties is a central challenge in early-stage drug discovery, where experimental determination remains costly and time-consuming. In this work, we propose a quantum-inspired preprocessing framework in which statistical dependencies among molecular descriptors are encoded into a parameterised many-body Hamiltonian, and the expectation values obtained by simulating its time evolution serve as additional inputs to a gradient-boosted ensemble model (CatBoost). Mutual information (MI) is used both to select the most informative descriptors and to set the coupling strengths of the Hamiltonian, so that the induced entanglement structure reflects empirically measured feature correlations; the evolution is realised with a short digitised-counterdiabatic schedule that generates a compact set of expectation-value features while keeping the circuit shallow. The resulting quantum-derived feature vectors are concatenated with the full MapLight descriptor set, concatenated ECFP, Avalon, and ErG fingerprints together with RDKit physicochemical properties, before training. We evaluate the pipeline on the AqSolDB aqueous solubility benchmark from the Therapeutics Data Commons (TDC) platform, achieving a mean absolute error (MAE) of 0.746 +/- 0.006 log(mol/L), which is within the reported error bars of the current top-performing model on the TDC leaderboard (MAE = 0.741 +/- 0.013). Ablation experiments show that the quantum-derived features match classical second-degree polynomial interaction features derived from the same MI-selected subset, while forming a far more compact representation (85 quantum features versus up to 4,950 polynomial terms, an approximately 58-fold reduction). SHapley Additive exPlanations (SHAP) analysis identifies the physicochemical drivers of solubility predictions, offering interpretable insight into model behaviour. These results demonstrate that MI-guided Hamiltonian feature extraction can reproduce the performance of strong classical interaction models on aqueous solubility while generating a compact, interpretable feature representation that is compatible with future quantum execution.

10

BioMetAll v2.0: Introducing Scores, Metal Discrimination, and Side-Chain Descriptors for Predicting Metal-Binding Sites in Proteins.

Marechal, J. D.; Fernandez Diaz, R.; Pena Losada, R.; Sanchez Aparicio, J. E.; Gao, W.; Alemany, M.

2026-07-12 bioinformatics 10.64898/2026.07.09.737562 medRxiv

Top 0.1%

5.3%

Show abstract

Predicting the location of metal-binding sites in proteins is crucial for fundamental biological questions and biotechnological applications. Over the past decade, the rise in metal-bound protein structures in the Protein Data Bank, combined with advanced statistical models such as deep learning, has accelerated the development of metal-binding site prediction tools. Several approaches are now available, offering high-quality benchmarks and predictive performance. Our initial development in this area is BioMetAll, whose first version was based on backbone pre-organization. Here, we introduce its second version, featuring two major updates: 1) metal-specific scoring functions and 2) prediction using backbone geometry alone or in combination with first coordination sphere descriptors. Apart from demonstrating metal sensitivity and yielding better benchmarking results, this new version allows the assessment of the influence of considering the metals first coordination sphere versus backbone pre-organization on how metallic species bind to proteins.

11

AptCancerDB: A Curated Knowledgebase and Translational Discovery Platform for Anticancer Aptamers

Bajiya, N.; Singh, S.; Raghava, G. P. S.

2026-07-09 cancer biology 10.64898/2026.07.02.735999 medRxiv

Top 0.1%

4.9%

Show abstract

Aptamers are emerging as important molecular recognition ligands in oncology, playing significant roles in cancer diagnostics, targeted therapies, drug delivery systems, and molecular imaging. Numerous aptamers have advanced to clinical trials, indicating their potential for real-world applications; however, existing databases fail to capture that. To bridge this critical gap, we developed AptCancerDB (https://webs.iiitd.edu.in/raghava/aptcancerdb/), a comprehensive, manually curated database of experimentally verified anticancer aptamers. The current release contains 1,941 entries collected from studies published between 2000 and 2025, covering 29 cancer types, approximately 200 cancer cell lines, and direct links to 22 clinical trials. Each entry is annotated with sequence information, target details, cancer type, cell line, SELEX methodology, affinity determination data, chemical modifications, and biological activities. The dataset is dominated by 82.7% ssDNA, reflecting its superior stability and ease of synthesis, while only 16.6% is ssRNA and appears primarily in studies targeting complex intracellular or protein-protein interactions. To facilitate structural analysis, predicted secondary structures, dot-bracket notations, specific structural elements, and minimum free energy values were also included. AptCancerDB integrates a MySQL backend with an ArcadeDB/OpenCypher-based Knowledge Graph, enabling exploration of relationships among aptamers, targets, cancer types, cell lines, and functional applications. The platform provides advanced search and browsing facilities, BLASTn-based similarity searching, and GC Calculator. Built on a modern, responsive frontend (React/TypeScript/Tailwind CSS), the platform includes a REST API for data retrieval. By integrating fragmented experimental data into a unified cancer-focused resource, AptCancerDB serves as a valuable resource for comparative analysis, aptamer discovery, and the development of next-generation aptamer-based diagnostics and therapeutics. HighlightsO_LICurated knowledge base of experimentally validated anticancer aptamers. C_LIO_LIAptCancerDB contain therapeutic, tumor-homing and cell-penetrating aptamers. C_LIO_LISummarizes clinical progress and translational trends in anticancer aptamer research. C_LIO_LISupports rational aptamer design using molecular, functional, and clinical annotations C_LIO_LIDisease-focused resource for cancer diagnosis, therapy, and drug delivery C_LI TeaserAptCancerDB maintains experimentally validated anticancer aptamers relevant to diagnosis, drug delivery, and therapy.

12

Pharmacological Stratification of Public Bioactivity Databases: A Reusable, OECD-Anchored Curation and Benchmarking Framework Demonstrated for Opioid Receptors

Nael, M.; Alakonda, L.; Ghosh, A.; Ward, S. J.; Liu-Chen, L.-Y.; Rajadhyaksha, A. M.; Abou-Gharbia, M.; Elokely, K. M.

2026-06-24 bioinformatics 10.64898/2026.06.18.732083 medRxiv

Top 0.2%

4.4%

Show abstract

Public bioactivity databases are heterogeneous not only in measurement type, where binding affinities and functional potencies are reported on different scales, but in pharmacology: the same compound and target can carry agonist, antagonist, or inhibitor records measured through binding displacement, cAMP, {beta}-arrestin, or [35S]GTP{gamma}S readouts that quantify different biological events. Pooling these records produces models whose output is detached from any coherent pharmacological claim. Prior work has standardized bioactivity at scale and quantified the noise from mixing measurement types, but pharmacological mechanism and assay-readout class have not been treated as a primary axis of large-scale curation. This study presents an auditable, OECD-anchored framework that stratifies public records by action type and assay readout before modeling, converting heterogeneous data into externally validated, interpretable QSAR tasks that compose with existing standardization resources rather than replacing them. The framework is demonstrated on the four opioid receptors (MOR, DOR, KOR, and nociceptin/orphanin FQ, NOP). Four public sources were reconciled into 72,148 merged records and 50,977 curated measurements spanning 19,585 compounds, each carrying auditable attributes for source agreement, endpoint meaning, pharmacology class, assay readout, and trust tier. Receptor-level binding tasks formed a compact benchmark with strong locked external performance, including KOR pK (R2 = 0.79, n = 798) and DOR pK (R2 = 0.77, n = 736). Pharmacology- and readout-resolved functional endpoints yielded externally validated strata that pooled labels would obscure, including a MOR antagonist functional-inhibition endpoint (R2 = 0.86, n = 110) and agonist potency endpoints for DOR, KOR, and MOR (R2 up to 0.81). Comparison against a fully pooled baseline shows that pooled models either match stratified models on coherent endpoints or reach a deceptively high R2 on functional-IC50 endpoints by training predominantly on binding-displacement records, so the pooled number predicts affinity rather than functional activity. SHAP attribution indicates that binding and functional potency encode partially distinct structure-activity signals. The dataset contract, not model performance alone, defines the validity and scope of a QSAR claim, and stratification is a precondition for a functional model to support a defensible claim. Curation logic, derived tables, frozen data, and reproducibility artifacts are released.

13

BoltzProt-1: Towards Efficient De Novo Binder Design with Good Developability

Ucar, T.; Bates, J.; Fu, Y.; Shi, W.; Stark, H.; Nava, D.; Cavalleri, L.; Wohlwend, J.; Corso, G.; Passaro, S.

2026-06-27 bioinformatics 10.64898/2026.06.23.733997 medRxiv

Top 0.2%

4.0%

Show abstract

Designing binders against novel protein targets remains a central challenge in computational drug discovery. Here we introduce BoltzProt-1, a pipeline for generating protein binders, including nanobodies, with improved hit rates and favorable developability properties. At its core lie a refined iteration of BoltzGens generative model and a novel protein-protein interaction prediction model, BoltzPPI. Employing BoltzPPI instead of BoltzGens standard structure-prediction confidence metrics to rank nanobody (VHH) designs increases the confirmed-binder hit rate from 3.3% to 8.0% across 10 novel targets. Assessed on 10 additional targets used in prior literature, the BoltzProt-1 pipeline obtains nanobody screening hits for 7 of 10 targets, surpassing the 6 of 10 previously reported by Chai-2. Finally, evaluating the developability of BoltzProt-1-designed nanobodies in terms of stability, aggregation, purity, polyspecificity and hydrophobicity reveals that 58% of its confirmed binders pass every criterion, exceeding both BoltzGen (40%) and clinical-stage VHH controls (21%). O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=104 SRC="FIGDIR/small/733997v1_ufig1.gif" ALT="Figure 1"> View larger version (39K): org.highwire.dtl.DTLVardef@125fb31org.highwire.dtl.DTLVardef@8e7482org.highwire.dtl.DTLVardef@8318a1org.highwire.dtl.DTLVardef@c62ab5_HPS_FORMAT_FIGEXP M_FIG C_FIG

14

EnzyKAN: Protein Language Model Embeddings and Kolmogorov-Arnold Network Variants for Enzyme Commission Classification with a Proposed Electron-Transfer Physics Feature Framework

R, S.; Reddy, B. R. R.

2026-06-29 bioinformatics 10.64898/2026.06.23.734004 medRxiv

Top 0.2%

3.9%

Show abstract

MotivationComputational enzyme classification has previously utilised sequence homology features and protein language model embeddings. The Kolmogorov-Arnold Network (KAN) paradigm, which uses learnable edge functions rather than fixed ones, has shown promising results in biological sequence tasks. ResultsA fully reproducible investigation of KAN variants for seven-class EC classification on up to 9,516 labelled sequences from the CLEAN benchmark [1] (9,386 for language model experiments). In the sequence only settings, fixed basis KAN variants outperformed an MLP baseline moderately (macro F1 = 0.17-0.29). Utilisation of ESM-2 650M embeddings [2] greatly improved results via 5-fold cross-validation: MLP macro F1 = 0.750 {+/-} 0.009, accuracy = 0.823 {+/-} 0.009; learnable SineKAN macro F1 = 0.716 {+/-} 0.023, accuracy = 0.788 {+/-} 0.019. MLP performed comparably but did not exceed conventional baselines. As an aside, we introduce but do not investigate an approach to EC oxidoreductase sub-classification through the use of a Marcus theory-based electron transfer feature framework. AvailabilityCode and result files are available at https://github.com/sanjuz-cas/ENZYKAN.

15

Identifying and Addressing Systematic Data Leakage in Protein-Ligand Affinity Benchmarks

Mattsson, B.;Walters, W.

2026-06-30 Molecular Biology 10.64898/2026.06.29.735309 medRxiv

Top 0.2%

3.4%

Show abstract

Accurate prediction of protein-ligand binding affinity is a crucial goal in structure-based drug discovery, with the potential to significantly shorten development timelines. Recently, a new wave of machine learning models based on co-folding, such as Boltz-2 and IsoDDE, has demonstrated performance that matches or exceeds that of gold-standard physics-based methods like Free Energy Perturbation (FEP). This paper provides a critical assessment of these claims, revealing that current benchmarks are heavily influenced by data leakage, and proposes a new benchmark that explicitly controls for data leakage. We demonstrate that splitting by protein-sequence identity is inherently insufficient to prevent data leakage due to "target mirroring," in which homologous proteins with low overall sequence identity still exhibit highly correlated binding profiles. Our meta-analysis of documents in the ChEMBL 36 database identifies more than 6,000 such assay pairs and finds that leakage persists for sequence-identity thresholds as low as 0.2, well below the values commonly used in benchmarks today. Additionally, we show that a ligand-only baseline model, which lacks protein structural information, achieves surprisingly high performance on the FEP+ 4 and OpenFE benchmarks (r = 0.66 and r = 0.36, respectively). Our results indicate that current benchmarks tend to reward models for memorizing training data and exploiting localized leakage rather than truly learning biophysical principles. To address this issue, we propose the Novelty-Tiered Affinity Benchmark, in which the test data is partitioned into ligand novelty tiers. In the most challenging tier (Tanimoto similarity < 0.35), ligand-only models perform notably worse (r = 0.14), offering a clear baseline for evaluating genuine generalization. We argue that the field must move beyond sequence-based splits to ensure that AI-driven discovery translates into successful prospective laboratory research.

16

Acquiring Improved Protein Variants With Probabilistic Preferential Learning

van der Flier, F. J.; de Ridder, D.; Probst, D.; Redestig, H.

2026-06-26 bioinformatics 10.64898/2026.06.22.733688 medRxiv

Top 0.2%

3.3%

Show abstract

Variant effect prediction (VEP) models can be used to select promising novel enzymes from a pool of candidates. Most supervised VEP models are framed as regression tasks, placing more emphasis on getting the predicted quantities correct than on the relative comparison of individual candidates. Preferential or contrastive models may better align with the goal of selection, or acquisition, especially when informed by predictive uncertainty. Here, we introduce a probabilistic preferential learning model based on the Kermut Gaussian process (PKermut) that we designed with the ambition to increase the hit rate among selected variants. We benchmark PKermut against established models, including the original Kermut, the RITA regressor, and an augmented Potts model, on 69 curated ProteinGym datasets across various assay categories. To evaluate acquisition performance, we propose a novel quantile cross-validation scheme that ensures the evaluation of a models ability to extrapolate by reserving high-performing variants exclusively for the test set. We assess models using Spearman correlation and evaluate their acquisition performance using five different acquisition functions, encompassing both uncertainty-aware and unaware strategies. Our experimental results indicate that uncertainty estimates improve the acquisition ability of our models, and that strategies that reward uncertainty generally result in better outcomes than those that do not on single-mutation variant datasets. We observe that PKermuts Spearman scores and ability to acquire improved variants are greatly affected by the number of variant comparisons sampled in the training set. Kermut achieves the highest Spearman correlation in 54/69 datasets (78%), compared to 12/69 (17%) for PKermut. For acquisition performance, Kermut leads in 44/69 datasets (64%), while PKermut leads in 15/69 (22%). While at this stage PKermut is not a recommended alternative to Kermut, its contrastive nature offers several conceptual opportunities. We share our findings to inspire further development aimed at improving the alignment between training objectives of VEP models and their downstream application in protein engineering.

17

The Hidden Disorder Divide: Reconciling Benchmark Inconsistencies in Intrinsically Disordered Protein Binding Site Prediction

Malhis, N.; Mehdiabadi, M.; Erdos, G.; Gsponer, J.; Kurgan, L.; Tosatto, S. C. E.; Dosztanyi, Z.; Piovesan, D.

2026-06-27 bioinformatics 10.64898/2026.06.24.733783 medRxiv

Top 0.2%

3.3%

Show abstract

Computational predictors of protein-binding sites within intrinsically disordered regions (IDRs) show highly inconsistent performance across high-quality benchmark datasets. To understand the origins of these discrepancies, we systematically compared predictors across three independent test sets: two CAID datasets updated with the latest DisProt annotations and a composite dataset (DBs) assembled from DIBS, FuzDB, IDEAL, and MFIB. Predictors trained predominantly on DisProt data achieved substantially higher AUCs on the CAID sets but performed poorly on the DBs. In contrast, predictors trained on older, low-quality PDB-based datasets showed balanced performance across all sets, with a slight preference for DBs. Predictors with mixed training exposure displayed intermediate behavior. Through controlled experiments using identical CNN architectures and feature analysis, we demonstrate that the dominant factor driving these performance differences is the intrinsic disorder propensity of the binding sites themselves. Binding residues in DisProt-based datasets exhibit markedly higher average disorder propensity scores than those in PDB-derived datasets. This previously unrecognized selection bias -- literature studies preferentially characterizing more disordered binding sites, while PDB-derived annotations capture less disordered ones -- effectively splits IDR-protein binding sites into two distinct categories. Predictors optimized on one category therefore generalize poorly to the other. Binding-site length and sequence conservation play only minor or negligible roles in explaining the observed inconsistencies. These findings highlight a critical limitation in current benchmarking practices and training strategies for IDR-binding site prediction, underscoring the need for more balanced and disorder-aware reference datasets. Finally, the diagnostic techniques introduced here could prove valuable beyond the specific application examined in this study.

18

Deep-Interact Studio: An Interactive Deep Learning Model Building Platform for Biomolecular Interaction Prediction

Sarkar, D.; Bardhan, K.; Sarkar, C.

2026-07-07 bioinformatics 10.64898/2026.07.02.736034 medRxiv

Top 0.2%

3.1%

Show abstract

Motivation: Deep learning has rapidly become essential for predicting biomolecular interactions; however, most web-tools expose only a single, pre-built model with a fixed, non-configurable architecture that users cannot redesign, retrain on their own data, or compare; they are typically dedicated to one interaction type and often one species, and report prediction scores with little interpretability. These constraints force researchers across several disconnected, single-purpose tools and limit the flexibility, reproducibility, and long-term usability of existing platforms. Results: We present Deep-Interact Studio, a unified, web-based deep-learning platform that addresses these limitations by shifting interaction prediction from a model-centric to a user-driven, comparative, and interpretable paradigm. Within a single interface spanning all four interaction classes, namely protein-protein, drug-target, RNA-protein, and protein-DNA, users design their own model architectures layer by layer, configure training hyperparameters, and train them on their own data, including custom, species-specific datasets. Multiple user-built models can then be trained under identical conditions and compared side by side at both the training and inference levels, while integrated interpretability, including SHAP-based feature attribution, embedding-space visualization, and interaction hub analysis, turns predictions into auditable, mechanistically grounded results. Deep-Interact Studio is, to our knowledge, the only such platform to combine fine-grained per-layer model customization with multi-model comparison and interpretability, offering a flexible and transparent alternative to fixed, single-purpose tools.

19

PEPstrMOD2: Next-generation tertiary structure prediction of chemically modified and non-natural peptides

Jain, S.; Mehta, N. K.; Raina, S.; Kumar, P.; Varun, ; Raghava, G. P. S.

2026-07-06 bioinformatics 10.64898/2026.06.22.733733 medRxiv

Top 0.2%

3.1%

Show abstract

While most existing methods are limited to predicting the tertiary structures of proteins containing only canonical residues, the PEPstrMOD server (developed in 2015) pioneered structure prediction for chemically modified and non-natural peptides. Despite its widespread use, the original framework was restricted to peptides of 7 to 25 residues and relied on older backbone-prediction algorithms. To address these limitations, we present PEPstrMOD2, which introduces three major advancements over its predecessor. First, it replaces the original in-house coordinate generation with state-of-the-art deep learning (DL) algorithms, leveraging AlphaFold2 and ESMFold for highly accurate initial structure prediction. Secondly, it greatly expands the accessible chemical space through incorporation of new, AMBER force-field compatible library of 257 post-translational modifications (PTMs), 428 non-canonical amino acids (NCAAs), and 243 terminal modifications. Lastly, through the application of native scalability of AlphaFold2 (AF2) and ESMFold (EF), PEPstrMOD2 eliminates the original restrictions of the length, enabling the structural modeling of longer, complex therapeutic peptides and small proteins. We evaluated the performance of PEPstrMOD2 against state-of-the-art methods across three distinct peptide datasets. For the AfCyc dataset consisting of 80 cyclic peptides, PEPstrMOD2 obtained a competitive average atom-level Root Mean Square Deviation (RMSD) of 2.05 angstroms, compared to 1.13 angstroms by AlphaFold3 (AF3) and 1.82 angstroms by AfCycDesign. Remarkably, for the modified peptide ModPep433 dataset, PEPstrMOD2 outperformed AF3, achieving the lower average RMSD score of 4.49 angstroms against 4.67 angstroms of AF3. Furthermore, in the case of the ModPep16 benchmark, PEPstrMOD2 achieved 2.50 angstroms average RMSD value, which is two times more accurate than that of the original PEPstrMOD (5.84 angstroms). In summary, PEPstrMOD2 provides a powerful, high-throughput, and highly accurate platform to facilitate peptide-based drug development and structural biology research. While the original PEPstrMOD was restricted to a web server interface, PEPstrMOD2 is available as both an intuitive webserver and a standalone command-line tool via GitHub, featuring Docker support for easy deployment and reproducible, large-scale modeling pipelines (https://webs.iiitd.edu.in/raghava/pepstrmod/).

20

BoltzMol-1: Towards Reliable Virtual Screening for Fast and Cost-Effective Hit Discovery

Getz, N.; Smith, G.; Colgan, A.; Fan, V.; Cavalleri, L.; Capponi, F.; Wohlwend, J.; Gitter, A.; Kritzer, J.; Maiorano, M.; Wlodarchak, N.; Corso, G.; Passaro, S.

2026-07-06 biochemistry 10.64898/2026.07.04.736485 medRxiv

Top 0.3%

2.7%

Show abstract

We present BoltzMol-1, a small-molecule hit discovery pipeline, centered on an optimized version of Boltz-2, explicitly adapted for prospective discovery. Reliable hit discovery that generalizes across target classes (rather than only the well-characterized families that dominate existing ligand data) would broaden the range of biology accessible to small-molecule intervention and reduce reliance on resource-intensive high-throughput screening. Towards this goal, the system prioritizes compounds for rapid experimental validation by coupling model-driven ranking with streamlined procurement from commercial catalogs. To improve developability at the point of selection, we introduce a suite of ADMET models for kinetic solubility (logS), lipophilicity (logD), and Caco-2 permeability. These models act as an early triage layer, systematically filtering out compounds with unfavorable physicochemical and absorption properties prior to synthesis or purchase. Across a panel of ten targets (most with no representation in the underlying affinity training data) we observe strong prospective performance on challenging systems. Functional actives or binders were identified for 6 of 10 targets, despite modest experimental budgets of 28-96 compounds per target. These results include successes on receptors and enzymes traditionally considered difficult for structure- or ligand-based approaches. Collectively, this work establishes a practical framework for low-throughput, cost constrained discovery campaigns capable of delivering chemically tractable binders with favorable property profiles.